Feat: Add api to get machines with leaks by srinivasadmurthy · Pull Request #570 · NVIDIA/ncx-infra-controller-core

srinivasadmurthy · 2026-03-16T05:46:26Z

Description

Type of Change

Add - New feature or capability
Change - Changes in existing functionality
Fix - Bug fixes
Remove - Removed features or deprecated functionality
Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Breaking Changes

This PR contains breaking changes

Testing

Unit tests added/updated
Integration tests added/updated
Manual testing performed
No testing required (docs, internal refactor, etc.)

Additional Notes

Tested by setting debug features cpu2temp_alert and leak_alert in crates/health/Cargo.toml.
Setting these generate relevant overrides and used grpcurl to test GetHardwareLeaksReport API.

copy-pr-bot · 2026-03-16T05:46:30Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

crates/api-db/src/machine.rs

crates/api/src/handlers/health.rs

crates/rpc/proto/forge.proto

crates/health/Cargo.toml

Matthias247

I don't know about the exact use-case for this.

But I'd prefer not to add APIs for searching for additional alerts for specific alert types, and instead rather extending the search filter passed to FindMachineIds to support searching by health probe IDs. That would be more universal and requires no new API.

srinivasadmurthy · 2026-03-17T05:28:10Z

@Matthias247 @kensimon Thanks for your review feedback. I have implemented the suggest changes and am requesting a re-review.

crates/api-db/migrations/20260316223033_machines_health_override_gin.sql

crates/rpc/proto/forge.proto

kensimon · 2026-03-17T19:52:59Z

I'm going to quote this comment from @srinivasadmurthy to get a discussion going:

This API is for use by RLA. Health monitor in carbide is scraping BMC sensors and detecting compute tray leaks. Once a leak is detected, it's placing a healthoverride with Leaks classification. RLA needs to query Carbide for leaking machines periodically, and then act on that. The returned data includes the leaking machine IDs, and their current power state. For each machine with a leak, RLA will issue two calls: UpdatePowerOptions to set the desired machine state to OFF, and then call AdminPowerControl to switch off the machine. Since this is supposed to respond to leaks reported by health monitor, it's not a general purpose search routine. Since responding to leaks needs to be fast, it's better to have a single API call that gives RLA all the information it needs, rather than getting Machine IDs first with a filter, and then call GetPowerOptions.

I really think if the goal here is to respond to leak alerts and shut machines off, having two different layers of polling (having to wait for the health monitor to scrape sensors from a very unreliable BMC API, then having to wait for RLA to pick up the results from the health monitor) is likely not going to be fast enough. You'd have to have an unreasonably fast polling interval to catch the alert in time to do something about it, and the cost of that is likely too much in a larger datacenter with lots of machines.

It seems like it'd be better for health events to stream directly to RLA, so that the instant a health override is added to carbide, it's also forwarded to RLA which can act on it directly, bypassing the polling altogether. Is this something we've thought about?

zhaozhongn · 2026-03-17T20:00:48Z

I'm going to quote this comment from @srinivasadmurthy to get a discussion going:

This API is for use by RLA. Health monitor in carbide is scraping BMC sensors and detecting compute tray leaks. Once a leak is detected, it's placing a healthoverride with Leaks classification. RLA needs to query Carbide for leaking machines periodically, and then act on that. The returned data includes the leaking machine IDs, and their current power state. For each machine with a leak, RLA will issue two calls: UpdatePowerOptions to set the desired machine state to OFF, and then call AdminPowerControl to switch off the machine. Since this is supposed to respond to leaks reported by health monitor, it's not a general purpose search routine. Since responding to leaks needs to be fast, it's better to have a single API call that gives RLA all the information it needs, rather than getting Machine IDs first with a filter, and then call GetPowerOptions.

I really think if the goal here is to respond to leak alerts and shut machines off, having two different layers of polling (having to wait for the health monitor to scrape sensors from a very unreliable BMC API, then having to wait for RLA to pick up the results from the health monitor) is likely not going to be fast enough. You'd have to have an unreasonably fast polling interval to catch the alert in time to do something about it, and the cost of that is likely too much in a larger datacenter with lots of machines.

It seems like it'd be better for health events to stream directly to RLA, so that the instant a health override is added to carbide, it's also forwarded to RLA which can act on it directly, bypassing the polling altogether. Is this something we've thought about?

Yes that's the long-term intention. In short-term, people were not sure about what health streaming/push mechanism should be, hence we opt to do this query model for now. It will still be very useful for other non-handling purpose (e.g., we will check if any tray in a rack has a leak before turning on host on a tray). But yes, handling scenario will switch to a fast method if needed.

crates/api-db/src/machine.rs

crates/rpc/proto/forge.proto

crates/api-db/src/machine.rs

Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>

## Description Drive by to ensure crates/systemd/src/systemd.rs is parsable on non-linux systems (macOS...). This is a no-op for linux systems. ## Type of Change - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [X] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Breaking Changes: NO. Signed-off-by: Patrice Breton <pbreton@nvidia.com> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>

… even though there are alerts. (NVIDIA#515) ## Description Fixes missing `host_machine_id` label in DPU logs and `telemetry_stats_log_records_count` metric by fetching the id through carbide API in forge-dpu-agent using the FindInterfaces request. The label is needed for `SuppressExternalAlerting` to work with the `noDpuLogsWarning` alert. The request is retried if the id isn't immediately available, using the `backon` crate to increase the retry interval to a maximum of every 5 minutes. Adds support for pending file contents to the `duppet` crate. ## Type of Change  - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [x] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional) https://nvbugspro.nvidia.com/bug/5668278 ## Breaking Changes - [ ] This PR contains breaking changes  ## Testing  - [ ] Unit tests added/updated - [ ] Integration tests added/updated - [x] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) Manual testing in local dev to verify - retry on failure - `/run/otelcol-contrib/host-machine-id` is created/updated/unchanged as expected. ## Additional Notes  --------- Signed-off-by: Tom Erickson <terickson@NVIDIA.COM> Co-authored-by: Ken Simon <ken@kensimon.io> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>

## Description This PR adds NVUE telemetry collection for NVLink Switches to the health service in a new collector: - NvueRest (HTTP polling) It is disabled by default and configurable (polling interval, request timeouts, and enablement of telemetry per path). Path enablement: - system_health_enabled: Poll [/nvue_v1/system/health](https://docs.nvidia.com/networking-ethernet-software/nvos-api-25024300/#/system/getSystemHealth) - cluster_apps_enabled: Poll [/nvue_v1/cluster/apps](https://docs.nvidia.com/networking-ethernet-software/nvos-api-25024300/#/cluster/getClusterApps) - sdn_partitions_enabled: Poll [/nvue_v1/sdn/partition](https://docs.nvidia.com/networking-ethernet-software/nvos-api-25024300/#/partition/getSdnPartitions) - interfaces_enabled: Poll [/nvue_v1/interface](https://docs.nvidia.com/networking-ethernet-software/nvos-api-25024300/#/interface/getInterface) ## Type of Change  - [x] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Testing  - [x] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes ~~Manual testing needed on rack (and soon to come).~~ Tested and working with `nvue_v1` running on NVOS 25 --------- Signed-off-by: Ivan Anisimov <ianisimov@nvidia.com> Co-authored-by: Ivan Anisimov <ianisimov@nvidia.com> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>

## Description  part two of dpf sdk refactor. this moves internal state handling largely to the dpf operator and lets the sdk trigger events when the state handler loop should act on dpu state changes. still using a custom bfb config and preloaded systemd services. adds dts as the first crd managed dpu service. holding back on using the other dpu services (present in other branch) until we can decide how those should be configured. the reasoning is not to break existing functionality from milestone 1 as those dpu services are untested. when all services are over, we should be able to remove the (backported) tera config. ## Type of Change  - [ ] **Add** - New feature or capability - [x] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional)  Closes FORGE-7959 ## Breaking Changes - [ ] This PR contains breaking changes  ## Testing  - [x] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes  --------- Signed-off-by: fspitulski <fspitulski@nvidia.com> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>

…ot running (NVIDIA#559) adds "update" to the stop command. This makes sure that supervisord knows about the dhcp server config before the stop is issued. this is important on newly provisioned DPUs since the dhcp server config doesn't exist initially. Also stops fetching the timestamps when dhcp is not running to avoiding noise in the logs ## Description  ## Type of Change  - [ ] **Add** - New feature or capability - [X] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional)  ## Breaking Changes - [ ] This PR contains breaking changes  ## Testing  - [ ] Unit tests added/updated - [ ] Integration tests added/updated - [X] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes  Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>

…n gRPC APIs and a new DB table. (NVIDIA#490) ## Description SPIFFE JWT-SVID machine identity support: add per-tenant identity configuration and token delegation via gRPC APIs and a new DB table. --- ## 1. Database Layer (`api-db`) ### New: `crates/api-db/src/tenant_identity_config.rs` - `TenantIdentityConfig` struct for per-org identity config - `set()` – upsert identity config (issuer, audiences, TTL, signing key) - `find()` – fetch config by org - `delete()` – remove config - `set_token_delegation()` – set token exchange config (endpoint, auth method, client secret) - `delete_token_delegation()` – clear delegation config - Placeholder key generation (no real encryption yet) ### New: `crates/api-db/migrations/20260225120000_tenant_identity_config.sql` - `tenant_identity_config` table with: - **Identity:** `issuer`, `default_audience`, `allowed_audiences`, `token_ttl`, `subject_domain_prefix`, `enabled` - **Signing:** `encrypted_signing_key`, `signing_key_public`, `key_id`, `algorithm`, `master_key_id` - **Timestamps:** `created_at`, `updated_at` - **Delegation:** `token_endpoint`, `auth_method`, `encrypted_auth_method_config`, `subject_token_audience`, `token_delegation_created_at` - FK to `tenants(organization_id)` with `ON DELETE CASCADE` --- ## 2. gRPC API (`rpc`) ### New Proto Messages - `GetIdentityConfiguration` / `SetIdentityConfiguration` / `DeleteIdentityConfiguration` - `GetTokenDelegation` / `SetTokenDelegation` / `DeleteTokenDelegation` - Messages: `GetIdentityConfigRequest`, `IdentityConfigRequest`, `IdentityConfigResponse`, `TokenDelegationRequest`, `TokenDelegationResponse`, `GetTokenDelegationRequest` --- ## 3. API Handlers (`handlers/identity_config.rs`) ### New: `crates/api/src/handlers/identity_config.rs` (657 lines) - `get_identity_configuration` – read config by org - `set_identity_configuration` – upsert config with org validation - `delete_identity_configuration` – delete config - `get_token_delegation` – read delegation config - `set_token_delegation` – upsert delegation config - `delete_token_delegation` – clear delegation ### Helper Functions - `compute_secret_hash()` – SHA256 hash for secrets - `truncate_hash_for_display()` – truncate hash for display - `struct_to_json` / `json_to_struct` – protobuf ↔ JSON - `build_response_auth_config()` – omit secrets from responses ### Unit Tests (10) - `compute_secret_hash`, `truncate_hash_for_display` - `struct_to_json`, `json_to_struct`, `json_to_struct_roundtrip` - `build_response_auth_config_omits_client_secret`, `truncates_hash`, `passes_through_non_secret`, `non_object_returns_clone` --- ## 4. Configuration (`cfg/file.rs`) ### New: `MachineIdentityConfig` - `enabled`, `algorithm`, `token_ttl_min`, `token_ttl_max`, `token_endpoint_http_proxy` - New `[machine_identity]` section in `CarbideConfig` --- ## 5. Integration Test Support (`api-test-helper`) ### New: `crates/api-test-helper/src/identity_config.rs` - `set_identity_configuration()`, `get_identity_configuration()`, `delete_identity_configuration()` - `set_token_delegation()`, `get_token_delegation()`, `delete_token_delegation()` - Uses grpcurl for gRPC calls --- ## 6. Integration Tests (`api-integration-tests`) ### `run_identity_config_tests()` in `tests/lib.rs` - Runs after tenant creation in `test_integration` - Sets config → get → delete - Sets config again → sets token delegation → get delegation → delete delegation --- ## 7. Fixes ### `api_fixtures/mod.rs` - Added `machine_identity: MachineIdentityConfig::default()` to `get_config()` in `CarbideConfig` --- ## 8. Documentation ### `book/src/design/machine-identity/spiffe-svid-sdd.md` - SDD for SPIFFE JWT-SVID machine identity - Architecture, config flows, token delegation --- ## Files Changed (identity_config-related) | File | Change | |------|--------| | `crates/api-db/src/tenant_identity_config.rs` | New | | `crates/api-db/migrations/20260225120000_tenant_identity_config.sql` | New | | `crates/api/src/handlers/identity_config.rs` | New | | `crates/api/src/handlers/mod.rs` | Register handler | | `crates/api/src/handlers/machine_identity.rs` | Modified | | `crates/api/src/api.rs` | Route new RPCs | | `crates/api/src/cfg/file.rs` | Add `MachineIdentityConfig` | | `crates/api/src/tests/common/api_fixtures/mod.rs` | Add `machine_identity` | | `crates/api-test-helper/src/identity_config.rs` | New | | `crates/api-test-helper/src/lib.rs` | Export `identity_config` | | `crates/api-integration-tests/tests/lib.rs` | Add `run_identity_config_tests()` | | `crates/rpc/proto/forge.proto` | New RPCs and messages | | `book/src/design/machine-identity/spiffe-svid-sdd.md` | Updated | ## Type of Change  - [x] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional) NVIDIA#447 ## Breaking Changes - [ ] This PR contains breaking changes  ## Testing  - [x] Unit tests added/updated - [x] Integration tests added/updated - [x] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes This PR is part of larger feature implementation related to NVIDIA#261.  Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>

## Description Some Lenovo platforms (HS350X V3) do not support SOL over SSH, so for those we switch to IPMI if site-explorer reports `LenovoAMI` as the BMC vendor. See: NVIDIA#528 Note: These machines still report as `Lenovo` in the DMI data. ## Type of Change  - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [x] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional)  ## Breaking Changes - [ ] This PR contains breaking changes  ## Testing  - [ ] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes  Signed-off-by: Krish Dandiwala <kdandiwala@nvidia.com> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>

…NVIDIA#573) ## Description This PR removes the lenovo-specific `boot_first(Pxe)` from the instance `invoke_power()` call. When neither `boot_with_custom_ipxe` or `run_provisioning_instructions_on_every_boot` flag is set, the user expects a normal OS reboot, not a forced network boot. Boot-order verification and correction for network-boot flows is now handled by the state machine when required, so this lenovo override is unnecessary in the default reboot path. This covers the following scenarios: 1. Disk first: the customer OS boots immediately. 2. DPU/network first: the host attempts PXE, then exits/falls through to the installed OS. 3. Custom install required, handled by the state machine: a. If network boot is already first, the state handler proceeds with the install flow. b. If disk is first, the state handler corrects boot order (making network/DPU first) and then proceeds with the install flow. ## Type of Change  - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [x] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional) NVIDIA#530 ## Breaking Changes - [ ] This PR contains breaking changes  ## Testing  - [ ] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [x] No testing required (docs, internal refactor, etc.) ## Additional Notes  Signed-off-by: Krish Dandiwala <kdandiwala@nvidia.com> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>

## Description **Problem**: Machines can become unresponsive (e.g., left in BIOS menu, OS issues, power faults) while they are in the `Ready` state, creating silent failures that go undetected by carbide until instance creation attempts fail. A metric was created for this issue but a host's health state never reflected the timeout, so allocations weren't blocked. **Fix**: Add scout heartbeat timeout health alert for machines in the `Ready` state: - Create a merge health override when `last_scout_contact_time` exceeds `scout_reporting_timeout` (default 5 minutes) - Remove the override automatically when scout heartbeat recovers in Ready - Clear the scout heartbeat timeout alert whenever the host transitions out of Ready, so stale alerts do not leak across state changes. - Continue emitting `hosts_with_scout_heartbeat_timeout` metric - By default, the alert does not block allocations and suppresses external alerting. To change this behavior, set the following in the carbide config: ``` [host_health] # Set to true to block allocations on hosts with scout heartbeat timeout (default: false) prevent_allocations_on_scout_heartbeat_timeout = true # Set to false to include these hosts in the unhealthy hosts Prometheus alert (default: true) suppress_external_alerting_on_scout_heartbeat_timeout = false ``` ## Type of Change  - [x] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional)  ## Breaking Changes - [ ] This PR contains breaking changes  ## Testing  - [x] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes  --------- Signed-off-by: Krish Dandiwala <kdandiwala@nvidia.com> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>

…VIDIA#371) ## Description This builds on the [firmware management work](NVIDIA#323) (and `ApplyFirmware`) to additionally implement `ApplyProfile` within the DPA provisioning workflow (it has been stubbed out with placeholders). The `ApplyProfile` state now handles `mlxconfig` profile management -- resetting the device's `mlxconfig` parameters to factory defaults between tenancies, and then optionally applying a named `MlxConfigProfile` if one is configured for the interface. This behavior of reset + apply updated values is the recommended guidance from NBU. High level changes include: 1. New `mlxconfig_profile` column on `dpa_interfaces` -- an optional profile name that maps into `carbide-api`'s `mlxconfig_profiles` config map. 2. Reworking the `OpCode::ApplyProfile` variant to carry an `Option<SerializableProfile>` (mirroring how `ApplyFirmware` carries a `FirmwareFlasherProfile`). 3. `carbide-api`-side config lookup + serialization in `build_apply_profile_command`. 4. `scout`-side implementation in `mlx_device::apply_profile()`. 5. Corresponding State Controller updates to handle both the reset-only and reset + profile sync workflows. In this workflow: 1. We check the interface's `mlxconfig_profile` field. 2. If `None`, we send `ApplyProfile { serialized_profile: None }`, and `scout` will reset to factory defaults (to prepare for the next tenant) and report success. 3. If set, we look it up in the `runtime_config.mlxconfig_profiles` map, serialize it via `SerializableProfile::from_profile()`, and send it down to `scout`. 4. If the profile name is set, but can't be found in config, we return an error rather than sending `None` (which would silently reset without applying any intended profile(s)). 5. `scout` always resets mlxconfig to factory defaults first, then applies the profile if one was provided, and reports back via `MlxObservation`. The `ApplyProfile` state handler was also broken out into its own `handle_apply_profile()` function, making it independently testable without needing the full async state controller scaffolding. I need to go back and do this in a few other pre-existing places. Existing tests updated as needed, and new tests introduced. Signed-off-by: Chet Nichols III <chetn@nvidia.com> ## Type of Change  - [x] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [x] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional)  ## Breaking Changes - [ ] This PR contains breaking changes  ## Testing  - [x] Unit tests added/updated - [x] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes  Signed-off-by: Chet Nichols III <chetn@nvidia.com> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>

## Description Update to new name ## Type of Change  - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [X] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional)  ## Breaking Changes - [ ] This PR contains breaking changes  ## Testing  - [ ] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [X] No testing required (docs, internal refactor, etc.) ## Additional Notes  Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>

## Description  delete dpf-sdk crate, rename dpf-sdk-beta crate to dpf-sdk, update references to dpf-beta to dpf. fix license headers to apache 2. (old) dpf-sdk crate is already unused. this is just cleanup. ## Type of Change  - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [x] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional)  ## Breaking Changes - [ ] This PR contains breaking changes  ## Testing  - [ ] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [x] No testing required (docs, internal refactor, etc.) ## Additional Notes  --------- Signed-off-by: fspitulski <fspitulski@nvidia.com> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>

## Description This continues the work from NVIDIA#606, NVIDIA#602, NVIDIA#598, NVIDIA#596, NVIDIA#608, and NVIDIA#610. TLDR is we had leaked a few things from `::rpc` into the `api-db` layer, which we generally don't want to do, and now that we have `STYLE_GUIDE.md`, it was good to practice what we preach. Everything else has been refactored. This handles the last bit of it, and then kicks `carbide-rpc` out of the `carbide-api-db` crate entirely. Signed-off-by: Chet Nichols III <chetn@nvidia.com> ## Type of Change  - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [x] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional)  ## Breaking Changes - [ ] This PR contains breaking changes  ## Testing  - [ ] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes  Signed-off-by: Chet Nichols III <chetn@nvidia.com> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>

## Description On numerous occasions, I have wanted to go into the admin UI to look at DHCP lease information directly, but it's not there. And I wanted to do it again today. And as usual, it's not there. And I was like well, if it's not there, we should add it. But what else should we add? And then I thought, you know, we probably could use an IPAM section in general, and have it be a place to look at: - DHCP allocations (because we have `carbide-dhcp`). - Authoritative DNS entries (because we have `carbide-dns`). - Underlay networks/prefixes (because we manage those). - Overlay networks/prefixes (and we manage those too). Right now this is just taking care of the DHCP part, and I'm adding placeholders for DNS and networks. DHCP details include: ``` struct DhcpEntryDisplay { ip_address: String, mac_address: String, machine_id: String, hostname: String, created: String, last_dhcp: String, last_dhcp_rfc3339: String, } ``` I hope people like this idea, because it's going to make me a lot happier. Signed-off-by: Chet Nichols III <chetn@nvidia.com> ## Type of Change  - [x] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional)  ## Breaking Changes - [ ] This PR contains breaking changes  ## Testing  - [ ] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes  Signed-off-by: Chet Nichols III <chetn@nvidia.com> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>

… to the admin UI (NVIDIA#627) ## Description Now that we're starting to turn up rack components (`Rack`, `PowerShelf`, `Switch`) in Carbide, and since we have `ExpectedRack`, `ExpectedPowerShelf`, and `ExpectedSwitch`, it makes sense to have these available in the admin UI under the `Rack`, `Power Shelf`, and `Switch` sections, which right now just have managed ones, and not the expected details. This adds all of that, including linked information, and the status of explored/adopted components. Signed-off-by: Chet Nichols III <chetn@nvidia.com> ## Type of Change  - [x] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional)  ## Breaking Changes - [ ] This PR contains breaking changes  ## Testing  - [ ] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes  Signed-off-by: Chet Nichols III <chetn@nvidia.com> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>

@Matthias247

## Description In another review, @Matthias247 pointed out that we could/should probably have a pattern where `Args` can just `.into()` (or `.try_into()?`) the underlying `Request` that they exist to populate. Most cases of this are super straightforward, so I'm doing that to some more of them (in addition to the ones I've already implemented). At the end of the day, a "command" is now something like.. ``` let req = args.try_into()?; let resp = api_client.0.call(req).await?; ..do something w/resp ``` ...and probably allows us to get even deeper into how we templatize things in the admin CLI. Signed-off-by: Chet Nichols III <chetn@nvidia.com> ## Type of Change  - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [x] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional)  ## Breaking Changes - [ ] This PR contains breaking changes  ## Testing  - [ ] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes  Signed-off-by: Chet Nichols III <chetn@nvidia.com> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>

## Description This new architecture document describes how various network partitioning technologies (DPUs, IB and NVLink) are integrated into NICo. It also acts as a guideline that future integrations should follow. ## Type of Change - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [x] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional) ## Breaking Changes - [ ] This PR contains breaking changes ## Testing - [ ] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [x] No testing required (docs, internal refactor, etc.) ## Additional Notes Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>

## Description Add mock hardware support for NVIDIA DGX H100 systems including: - 8x H100 80GB HBM3 GPUs with HGX chassis - 2x ConnectX-7B quad-port InfiniBand NICs - 1x ConnectX-7A dual-port storage NIC - 1x Intel E810 dual-port storage NIC - 1x Intel X550 management NIC - 1x BlueField-3 DPU (fixed count of 1) New NIC hardware modules: - nic_intel_e810: Intel E810 dual-port NIC - nic_intel_x550: Intel X550 NIC - nic_nvidia_cx7: ConnectX-7 variants (CX7A dual-port, CX7B quad-port) Additional changes: - Add AMI BMC vendor support with /SD settings path - Make manager eth_interfaces and firmware_version optional - Make NIC serial_number optional for Intel NICs without serials - Add fixed_number_of_dpu() for platforms with fixed DPU count - Add ok_no_content() helper for PATCH responses returning 204 - Add bmc_redfish_version() per hardware type ## Type of Change - [x] **Add** - New feature or capability - [x] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [x] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional) ## Breaking Changes - [ ] This PR contains breaking changes ## Testing - [ ] Unit tests added/updated - [ ] Integration tests added/updated - [x] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes Signed-off-by: Dmitry Porokh <dporokh@nvidia.com> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>

## Description update helm/.github codeowner ## Type of Change  - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional)  ## Breaking Changes - [ ] This PR contains breaking changes  ## Testing  - [ ] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [x] No testing required (docs, internal refactor, etc.) ## Additional Notes  Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>

## Description Add handling for force-delete and link the doc in the style guide ## Type of Change - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional) ## Breaking Changes - [ ] This PR contains breaking changes ## Testing - [ ] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>

## Description Currently WorkLockManager is explicitly cancelled with the toplevel CancellationToken. But handles to it would still exist in things like the state controllers, which might need to finish their current iteration before cancelling. So instead of explicitly cancelling WorkLockManager when the toplevel cancellation signal is received, let it cancel only after all handles are dropped (which they should be once all the dependents are done and drop their handles.) With this approach, the cancel signal will explicitly cancel the API listener and all the background controllers, which will drop the `Arc<Api>` handle, which will free the last handle to WorkLockManager, which will shut it down. The toplevel JoinSet still requires all tasks to be complete, so we still rely on all of this happening before the API server actually shuts down. (Currently only integration tests use the expilcit shutdown signal, but adding a proper "graceful shutdown on SIGINT" handler can be done trivially as a followup.) ## Type of Change - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [X] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional) ## Breaking Changes - [ ] This PR contains breaking changes ## Testing - [x] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes See discussion in NVIDIA#586 for details. Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>

Update typo to DCO section Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>

…IDIA#288) Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>

## Description Removed old `DpuSsh` API, and adds Generic `BmcCredentials` API. As for now it only supprots `UsernamePassword` credentials type, but in future should support `SessionTokens` as well. ## Type of Change  - [x] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [x] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional) NVIDIA#460 ## Breaking Changes - [x] This PR contains breaking changes Removes old DpuSSH credentials API. ## Testing  - [ ] Unit tests added/updated - [ ] Integration tests added/updated - [x] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes  Signed-off-by: ianisimov <ianisimov@nvidia.com> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>

@poroh

Continuation of NVIDIA#628. Wanted to do a bit first, and then do some more here, mainly since it's a lot to look at. TLDR is that we have a pattern where `Args` can just `.into()` (or `.try_into()?`) the underlying `Request` that they exist for. These next refactorings are also pretty straightforward, but I had punted them to a separate PR because some of them weren't AS straightforward as the original ones. At the end of the day, a "command" is now something like.. ``` let req = args.try_into()?; let resp = api_client.0.call(req).await?; ..do something w/resp ``` ..or in many cases (thanks @poroh for the callout here).. ``` api_client.0.call(args).await?; ``` ...and probably allows us to get even deeper into how we templatize things in the admin CLI. ## Description  ## Type of Change  - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [x] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional)  ## Breaking Changes - [ ] This PR contains breaking changes  ## Testing  - [ ] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes  Signed-off-by: Chet Nichols III <chetn@nvidia.com> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>

## Description This adds an introductory DNS section to the admin UI that includes: - Zones we are authoritative for. - The records we currently serve (including record information). There will be some iterative improvements here, but I want to get something out for people to work with and look at and go "oh we should do X and Y instead." Some improvements would probably be: - Pagination (either server or client side). - Filtering (either server or client side). - Better integration with the global search box (if it doesn't exist yet). - Etc. Signed-off-by: Chet Nichols III <chetn@nvidia.com> ## Type of Change  - [x] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional)  ## Breaking Changes - [ ] This PR contains breaking changes  ## Testing  - [ ] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes  Signed-off-by: Chet Nichols III <chetn@nvidia.com> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>

## Description This introduces an Overlay Networks section to the IPAM section in the admin UI. The drilldown is: - `/admin/ipam/overlay` - `/admin/ipam/overlay/prefix/{prefix-id}` - `/admin/ipam/overlay/segment/{segment-id}` When you go to the main **Overlay Networks** page, you get a table with: - Name - VNI - Prefixes You can then click on one of the prefixes for that VNI, which brings you to a **Prefix** view showing: - Prefix - Name - Gateway - Allocated IPs - Segments (prefix) You can THEN click on one of the segments in that VNI prefix, which brings you to a **Segment** view showing: - Segment - Parent Prefix - Table of allocated IPs (and the current instance its allocated to) This **ALSO** adds the **Underlay** section, which I was going to keep separate, and then just decided to make it all one PR. The drilldown is: - `/admin/ipam/underlay` - `/admin/ipam/underlay/segment/{segment-id}` I was hoping to reuse the segment view, but since the data is different, it's two separate ones right now. When you go to the main **Underlay Networks** view, you get a table with: - Name - Prefix - Type (admin, underlay, host_inband). - Gateway. - Allocated IPs. And then when you drill down to the segment, you get the parent prefix and a table of allocated: - IP - MAC, - Machine Signed-off-by: Chet Nichols III <chetn@nvidia.com> ## Type of Change  - [x] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional)  ## Breaking Changes - [ ] This PR contains breaking changes  ## Testing  - [ ] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes  Signed-off-by: Chet Nichols III <chetn@nvidia.com> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>

Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>

## Description I don't know how this actually snuck through -- build + tests passed? All said, it needs to go! This was the old placeholder for the IPAM admin UI section. Now that all of the sub-sections have been filled in, this isn't in use, and is causing errors for being dead code. Removing both the struct and it's associated HTML file. Signed-off-by: Chet Nichols III <chetn@nvidia.com> ## Type of Change  - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [x] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional)  ## Breaking Changes - [ ] This PR contains breaking changes  ## Testing  - [ ] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes  Signed-off-by: Chet Nichols III <chetn@nvidia.com> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>

…IA#656) Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>

…#654) Signed-off-by: Andrew Forgue <aforgue@nvidia.com> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>

…o 90 minutes (NVIDIA#650) ## Description This state requires the following operations - power-cycle the host - CheckHostConfig - ConfigureBios & PollingBiosSetup - SetBootOrder Power-cycle the host and SetBootOrder are time-consuming some-times depending on machine configurations. Some machines could take 20min for first try of SetBootOrder operations. ## Type of Change  - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [x] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Breaking Changes - [ ] This PR contains breaking changes  ## Testing  - [ ] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [x] No testing required (docs, internal refactor, etc.) ## Additional Notes  Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>

## Description This was discussed at length internally amongst [DSX](https://nvidianews.nvidia.com/news/nvidia-releases-vera-rubin-dsx-ai-factory-reference-design-and-omniverse-dsx-digital-twin-blueprint-with-broad-industry-support) and [NCX](https://docs.nvidia.com/ncx/index.html) teams. While historically [what is now known as] NICo has always generated its own internal ID for *components*, a rack ID is kind of a grey area between a component and an identifier. You can think of a rack as a supercomputer, or something akin to a blade server, BUT, you can also think of it as a place where components are stored. With that, we decided the `RackId` should be a `String` whose SoT comes from the DCIM (Datacenter Inventory Manager) system, since that is how the BMS (Building Management System) identifies racks, and events for racks, including things like leak detection. Instead of NICo generating its own stable `RackId` and maintaining mapping between it's internal rack ID and the DCIM rack ID, it was decided the `RackId` should just come from the DCIM as part of "expected component" ingestion: `ExpectedRack`, `ExpectedMachine`, `ExpectedPowerShelf`, and `ExpectedSwitch` entities will all be enqueued with the DCIM from the `RackId`. This ultimately allows for the DCIM rack ID to be the one and only SoT, and allows for all components in a DSX AI Factory to agree without confusion from alternative IDs. So, strip away the hardware-backed `RackId` plans, and move towards a newtype over `String`, allowing the DCIM to provide whatever it wants. This change is backwards compatible: - The database already stores as text, not a `uuid`, so we don't have to worry about conversion issues. - We are moving from an "encoded" value to an open `String` value, so even pre-existing encoded `RackIds` work with the `String`-backed newtype. - The gRPC `common.RackId` is still the same. It was a `String` to begin wtih. - The JSON serialization of it is still the same. We use `#[serde(transparent)]`, so it's still just `"rack_id": "whatever-id-you-want"`. The downside is now that it's a `String`, we can't `Copy`, so we have to `.clone()` in certain cases. I pass around by reference as much as I can, but not everywhere. Also added some tests to ensure: - The old format still works. - New strings work. - Empty strings don't work. Signed-off-by: Chet Nichols III <chetn@nvidia.com> ## Type of Change  - [ ] **Add** - New feature or capability - [x] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional)  ## Breaking Changes - [ ] This PR contains breaking changes  ## Testing  - [x] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes  Signed-off-by: Chet Nichols III <chetn@nvidia.com> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>

…IA#661) ## Description Now machine-a-tron supports handful of different types of hardware and it is useful to see information about hardware type of specific machine in TUI. ## Type of Change - [x] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [x] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional) ## Breaking Changes - [ ] This PR contains breaking changes ## Testing - [ ] Unit tests added/updated - [ ] Integration tests added/updated - [x] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes Signed-off-by: Dmitry Porokh <dporokh@nvidia.com> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>

## Description Don't need these. Noticed them when doing some other work. Longer technical explanation for those who don't know is that `.to_string()` takes `&self`, and creates a new `String` from it, so we're making a `.clone()` of something that would just be taken as a reference anyway. Signed-off-by: Chet Nichols III <chetn@nvidia.com> ## Type of Change  - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [x] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional)  ## Breaking Changes - [ ] This PR contains breaking changes  ## Testing  - [ ] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes  Signed-off-by: Chet Nichols III <chetn@nvidia.com> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>

…A#620) ## Description Fixes a gap where extension services marked for removal were not cleaned up from instance config even after all DPUs reported successful termination. Previously, terminated extension service cleanup was only executed in the `WaitingForExtensionServicesConfig` instance state; however, extension service config updates do not transition the machine out of Ready (only tenant state moves to `Configuring`). This change adds instance extension service config cleanup in the `Ready` state so terminated services can be cleaned up. ## Type of Change  - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [x] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional)  ## Breaking Changes - [ ] This PR contains breaking changes  ## Testing  - [x] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) Signed-off-by: Felicity Xu <hanyux@nvidia.com> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>

Matthias247 · 2026-03-20T22:24:32Z

something is broken with this PR. It contains a lot of changes. Probably rebase gone wrong.

srinivasadmurthy · 2026-03-20T22:28:15Z

Yes. I had not signed the commits, and followed github's recommendations to sign them (which included a rebase). That seems to have messed up the PR. I am going to create a new branch and create a new PR and discard this PR.

Feat: Add api to get machines with leaks

41e6276

srinivasadmurthy requested a review from a team as a code owner March 16, 2026 05:46

srinivasadmurthy requested review from FrankSpitulski and yoks March 16, 2026 05:46

kensimon requested changes Mar 16, 2026

View reviewed changes

Matthias247 reviewed Mar 16, 2026

View reviewed changes

kensimon requested changes Mar 17, 2026

View reviewed changes

crates/api-db/migrations/20260316223033_machines_health_override_gin.sql Outdated Show resolved Hide resolved

crates/rpc/proto/forge.proto Outdated Show resolved Hide resolved

Matthias247 reviewed Mar 17, 2026

View reviewed changes

crates/api-db/src/machine.rs Outdated Show resolved Hide resolved

crates/rpc/proto/forge.proto Outdated Show resolved Hide resolved

kensimon requested changes Mar 18, 2026

View reviewed changes

crates/api-db/src/machine.rs Outdated Show resolved Hide resolved

crates/api-db/src/machine.rs Outdated Show resolved Hide resolved

srinivasadmurthy requested a review from kensimon March 19, 2026 17:26

kensimon approved these changes Mar 19, 2026

View reviewed changes

Matthias247 reviewed Mar 19, 2026

View reviewed changes

crates/api-db/src/machine.rs Show resolved Hide resolved

crates/api-db/src/machine.rs Show resolved Hide resolved

srinivasadmurthy requested a review from kensimon March 20, 2026 05:23

srinivasadmurthy and others added 13 commits March 20, 2026 21:17

Feat: Add api to get machines with leaks

ccd4699

Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>

chet and others added 25 commits March 20, 2026 21:17

Fix typo in DCO compliance section (NVIDIA#642)

ea80d8e

Update typo to DCO section Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>

Added a full Architecture layout diagram and a Deployment diagram (NV…

ad1125b

…IDIA#288) Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>

Feat: Add api to get machines with leaks

d16b939

Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>

chore(admin-cli): last round of Args -> From/TryFrom -> Request (NVID…

363ce67

…IA#656) Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>

chore(admin-cli): last round of Args -> From/TryFrom -> Request (NVID…

fe22d57

…IA#656) Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>

chore: Update issue templates to standardize and remove dupes (NVIDIA…

3f1538d

…#654) Signed-off-by: Andrew Forgue <aforgue@nvidia.com> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>

srinivasadmurthy force-pushed the sdmrlav2 branch from 3cff21a to 4689c04 Compare March 20, 2026 21:17

srinivasadmurthy requested review from a team as code owners March 20, 2026 21:17

Conversation

srinivasadmurthy commented Mar 16, 2026

Description

Type of Change

Related Issues (Optional)

Breaking Changes

Testing

Additional Notes

Uh oh!

copy-pr-bot bot commented Mar 16, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Matthias247 left a comment

Choose a reason for hiding this comment

Uh oh!

srinivasadmurthy commented Mar 17, 2026

Uh oh!

Uh oh!

Uh oh!

kensimon commented Mar 17, 2026

Uh oh!

zhaozhongn commented Mar 17, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Matthias247 commented Mar 20, 2026

Uh oh!

srinivasadmurthy commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants